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Abstract 

Stochastic games are an important class of problems that generalize Markov decision processes to 
game theoretic scenarios. We consider finite state two-player zero-sum stochastic games over an infinite 
time horizon with discounted rewards. The players are assumed to have infinite strategy spaces and 
the payoffs are assumed to be polynomials. In this paper we restrict our attention to a special class of 
games for which the single-controller assumption holds. It is shown that minimax equilibria and optimal 
strategies for such games may be obtained via semidefinite programming. 

I. Introduction 

Markov decision processes (MDPs) are very widely used system modeling tools where a 
single agent attempts to make optimal decisions at each stage of a multi-stage process so as to 
optimize some reward or payoff [IJ. Game theory is a system modeling paradigm that allows 
one to model problems where several (possibly adversarial) decision makers make individual 
decisions to optimize their own payoff [2J. In this paper we study stochastic games O, a 
framework that combines the modeling power of MDPs and games. Stochastic games may be 
viewed as competitive MDPs where several decision makers make decisions at each stage to 
maximize their own reward. Each state of a stochastic game is a simple game, but the decisions 
made by the players affect not only their current payoff, but also the transition to the next state. 

This research was funded in pai't by AFOSR MURI subawards 2003-07688-1 and 102-1080673. 
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Notions of solutions in games have been extensively studied, and are very well understood. 
The most popular notion of a solution in game theory is that of a Nash equilibrium. While these 
equilibria are hard to compute in general, in certain cases they may be computed efficiently. 
For games involving two players and finite action spaces, mixed strategy minimax equilibria 
always exist (see, e.g., lIH). These minimax saddle points correspond to the well-known notion 
of a Nash equilibrium. From a computational standpoint such games are considered tractable 
because Nash equilibria may be computed efficiently via linear programming. Stochastic games 
were introduced by Shapley H| in 1953. In his paper, he showed that the notion of a minimax 
equilibrium may be extended to stochastic games with finite state spaces and strategy sets. He 
also proposed a value iteration-like algorithm to compute the equilibria. In 1981 Parthasarathy 
and Raghavan [[5]|, ^ studied single controller games. Single controller games are games where 
the probabilities of transitions are controlled by the action of only one player. They showed 
that stochastic games satisfying this property could be solved efficiently via linear programming 
(thus proving that such problems with rational data could be computed in a finite number of 
steps). 

While computational techniques for finite games are reasonably well understood, there has 
been some recent interest in the class of infinite games; see |l6l, [|3 and the references therein. 
In this important class, players have access to an infinite number of pure strategies, and the 
players are allowed to randomize over these choices. In a recent paper [6], Parrilo describes a 
technique to solve two-player, zero- sum infinite games with polynomial payoffs via semidefinite 
programming. It is natural to wonder whether the techniques from finite stochastic games can 
be extended to infinite stochastic games (i.e. finite state stochastic games where players have 
access to infinitely many pure strategies). In particular, since finite, single-controller, zero-sum 
games can be solved via linear programming, can similar infinite stochastic games be solved 
via semidefinite programming? The answer is affirmative, and this paper focuses on establishing 
this result. 

The main contribution of this paper is to provide a computationally efficient, finite dimensional 
characterization of the solution of single-controller polynomial stochastic games. For this, we 
extend the linear programming formulation that solves the finite action single-controller stochastic 
game (i.e., under assumption (SC) below), to an infinite dimensional optimization problem when 
the actions are uncountably infinite. We furthermore establish the following properties of this 
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infinite dimensional optimization problem: 

1) Its optimal solutions correspond to minimax equilibria. 

2) The problem can be solved efficiently by semidefinite programming. 

Section |ll] of this paper provides a formal description of the problem and introduces the basic 
notation used in the paper. We show that for two-player zero-sum polynomial stochastic games, 
equilibria exist and that the corresponding equilibrium value vector is unique. (This proof is 
essentially an adaptation of the original proof by Shapley in H for finite stochastic games). In 
Section HI] we also briefly review some elegant results about polynomial nonnegativity, moment 
sequences of nonnegative measures, and their connection to semidefinite programming. In Sec- 
tion Hill we briefly review the linear programming approach to finite stochastic games. Section HVl 
states and proves the main result of this paper. In Section |V] we present an example of a two- 
player, two- state stochastic game, and compute the equilibria via semidefinite programming. 
Finally, in Section |VI] we state some natural extensions of this problem, conclusions, and 
directions of future research. 

II. Problem description 

A. Stochastic games 

We consider the problem of solving two-player zero- sum stochastic games via mathematical 
programming. The game consists of finitely many states with two adversarial players that make 
simultaneous decisions. Each player receives a payoff that depends on the actions of both players 
and the state (i.e. each state can be thought of as a particular zero-sum game). The transitions 
between the states are random (as in a finite state Markov decision process), and the transition 
probabilities in general depend on the actions of the players and the current state. The process 
runs over an infinite horizon. Player 1 attempts to maximize his reward over the horizon (via 
a discounted accumulation of the rewards at each stage) while player 2 tries to minimize his 
payoff to player 1. If (a\, of, . . .) and (a^, a^, . . .) are sequences of actions chosen by players 1 
and 2 resulting in a sequence of states (si, S2, . . .) respectively, then the reward of player 1 is 
given by: 

oo 

^/3V(sfc,at,4). 

k=l 

The game is completely defined via the specification of the following data: 
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Fig. 1. A two state stochastic game. Tiie payoff functions associated to the states are denoted by n and 7-2. The edges are 
marlced by the corresponding state transition probabilities. 



1) The (finite) state space S = {1, . . . , 5*}. 

2) The sets of actions for players 1 and 2 given by Ai and A2. 

3) The payoff function, denoted by r(s, ai, 02), for a given set of state s and actions ai and 
a2 (of players 1 and 2). 

4) The probability transition matrix p(s'; s, ai, 02) which provides the conditional probability 
of transition from state s to s' given players' actions. 

5) The discount factor P, where < P < 1. 

To fix ideas, consider the following example of a two-state stochastic game (i.e. S = {1, 2}). 
The action spaces of the two players are Ai = A2 = [0, 1]. The payoff function in state 1 is 
r(l, 01,02) = ri(ai,a2) and the payoff function in state 2 is given by r(2, 01,02) = ^2(01,02). 
Both are assumed to be polynomials in oi and 02. The probability transition matrix is: 

p_ ^11(01,02) ^12(^1,02) 
P2i(cn,a2) P22 (01,02) 

Every entry in this matrix is assumed to be a polynomial in oi and 02. This stochastic game can 
be depicted graphically as shown in Fig. [IJ We will return to a specific instance of this example 
in Section |Vl where we explicitly solve for the equilibrium strategies of the two players. 

Through most of this paper (except Section HTCl) we make the following important assumption 
about the probability transition matrix: 

Assumption SC 

The probability transition to state s' conditioned upon the current state being s depends only on 
s, s', and the action oi of player 1 for every s and s'. This probability is independent of the action 
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of player 2. Thus, s, ai, 02) = s, ai). This is known as the single-controller assumption. 

In this paper we will mostly (except briefly, in Section [nl] where finite strategy spaces are 
considered) be concerned with the case where the action spaces Ai and A2 of the two players 
are uncountably infinite sets. For the sake of simplicity we will often consider the case where 
= ^2 = [0, 1] G M. The results easily generalize to the case where the strategy sets are 
finite unions of arbitrary intervals of the real line. For the sake of simplicity, we also assume 
that the action sets are the same for each state, though this assumption may be relaxed. We will 
denote by ai and 02, the actual actions chosen by players 1 and 2 from their respective action 
spaces. The payoff function is assumed to be a polynomial in the variables ai and 02 with real 
coefficients: 

r(s, 01,02) = ^^rij{s)a\a'2. 

i=i j=i 

Finally, we assume that the transition probability p{s'; s, Oi) is a polynomial in the action ai. 

The decision process runs over an infinite horizon, thus it is natural to restrict one's attention 
to stationary strategies for each player, i.e. strategies that depend only on the state of the process 
and not on time. Moreover, since the process involves two adversarial decision makers, it is also 
natural to look for randomized strategies (or mixed strategies) rather than pure strategies so as 
to recover the notion of a minimax equilibrium. A mixed strategy for player 1 is a finite set 
of probability measures ^ = [/i(l), . . . ,/i(5')] supported on the action set Ai. Each probability 
measure corresponds to a randomized strategy for player 1 in some particular state, for example 
nik) corresponds to the randomized strategy that player 1 would use when in state k. Similarly, 
player 2's strategy will be represented hy v = [z^(l), . . . , v{S)]. (A word on notation: Throughout 
the paper, indices in parentheses will be used to denote the state. Bold letters will be used indicate 
vectorization with respect to the state, i.e., collection of objects corresponding to different states 
into a vector with the i*^ entry corresponding to state i. The Greek letters ^, ji, v will be 
used to denote measures. Subscripts on these Greek letters will be used to denote moments 
of the measures. A bar over a greek letter indicates a (finite) moment sequence (the length of 
the sequence being clear from the context). For example ^j(i) denotes the moment of the 
measure ^ corresponding to state i, and ^(z) = [^o(O) • • • !^n(i)])- 
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A strategy fi leads to a probability transition matrix P{fi) such that Pij{fi) = /^^ p{j; i, ai)diJ,{i). 
Thus, once player 1 fixes a strategy fi, the probability transition matrix is fixed, and can be 
obtained by integrating each entry in the matrix with respect to the measure fi. (Since the entries 
are polynomials, upon integration, these entries depend affinely on the moments //(i)). Given 
strategies fi and u, the expected reward collected by player 1 in some stage s is given by: 

r(s, z/(s)) = / / r{s,ai,a2)dfi{s)diy{s). 
Jai Ja2 

The reward collected over the infinite horizon (for fixed strategies fi{s) and starting at 

state s, vp{s,fi{s), is given by the system of equations: 

f/3(s, /i(s), z^(s)) = r(s, /i(s), z^(s)) + 

PJ2s'es (/AiMs';s,ai)(i^(s)) v/3(s',//(s'),z/(s')) Vs. 
Vectorizing vp{s, fi{s),i'{s)), we obtain 

Mf^,iy) = {I-pP{fi))-h{fi,u), 

where r(^, u) = [r(l, z^(l)), . . . , r{S, fi{S),u{S))] e . 



B. Solution Concept 

We now briefly discuss the question: "What is a reasonable solution concept for stochastic 
games?" Recall that for zero-sum normal form games, a Nash equilibrium is a widely used 
notion of equilibrium in competitive scenarios. A Nash equilibrium in a two-player game is a 
pair of independent randomized strategies (say and u, one for each player) such that, given 
player 2 plays the v, player I's best response would be to play /i and vice- versa. It is an easy 
exercise that computation of Nash equilibria is equivalent to finding saddle points of the payoff- 
function. It is also well-known that Nash equilibria (or equivalently saddle points) correspond 
to the minimax notion of an equilibrium, i.e. points that satisfy the following equality: 

minmaxf (/i, z/) = maxmin f (/i, z/). 

fj, U U fj, 

While there may exist no pure strategies that satisfy this equality, it may be achieved by allowing 
randomization over the allowable strategies. 

In his seminal paper flU, Shapley generalized the notion of Nash equilibria to stochastic games. 
He defined the notion of a "stationary equilibrium" to be a pair of randomized strategies (over 
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the action space) that depended only on the state of the game. (Of course, to be an equilibrium, 
these mixed strategies must also satisfy the no-deviation principle). For stochastic games, once 
one restricts attention to stationary equilibria, instead of having unique "values" (as in normal 
form games), one has a unique "value vector". This vector is indexed by the state and the i*'' 
component is interpreted as the equilibrium value Player 1 can expect to receive (over the infinite 
discounted process) conditioned on the fact that the game starts in state i. Note that different 
states of the game may be favorable to different players. Since the actions affect both payoffs 
and state transitions, players must balance their strategies so that they receive good payoffs in 
a particular state along with favorable state transitions. The "no unilateral deviation" principle, 
saddle point inequality (interpreted row- wise, i.e., conditioned upon a particular state) and the 
equivalence of the minmax and maxmin over randomized strategies all extend to the stochastic 
game case, and when we restrict attention to games with just one state, we recover the classical 
notions of equilibrium. 

Definition 1: A pair of vector of mixed strategies (indexed by the state) fjP and which 
satisfy the saddle point property: 

vM//,i^°)<v^(/,^.°)<v^(/,i/) (1) 

for all (vectors of) mixed strategies fj,,u are called equilibrium strategies. The corresponding 
vector Vi3{fjP, v^) is called the value vector of the game. 

One should note that v^(yu, u) is a vector in R.^ indexed by the initial state of the Markov 
process. Hence the above inequality is a vector inequality and is to be interpreted componentwise. 
More precisely, if A is the action space, let A(^) denote the space of probability measures 
supported on A. Then the function vp is a function of the form: 

and equilibrium strategies correspond to the saddle-points of this function. The mixed strategies 
of the players are indexed by the state (i.e. there is one probability measure per state per player). 
These probability measures (conditioned upon the state) are independent across states, and are 
also independent across the players. 
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C. Existence of Equilibria 

In his original paper, Shapley [|4|| showed that stationary equilibria always exist (and that 
the corresponding value-vectors are unique) for two-player, zero-sum, finite state, finite action 
stochastic games. (Shapley considered games where at each state there was some probability of 
termination, where as in this paper we consider games over an infinite horizon with discounted 
rewards, as already mentioned. These two formulations are equivalent in the sense that starting 
from a discounted game one can construct a game with termination probabilities and vice- 
versa such that both have the same equilibrium value vectors.) In this subsection we address 
the existence and uniqueness issue, and prove that for two-player, zero-sum stochastic games 
over finite state spaces, infinite strategy spaces, and polynomial payoffs, stationary equilibria 
always exist, and that the value vectors are unique. Throughout the paper, we assume that the 
transition probabilities are polynomial functions of the actions of the players. It is important to 
note that the results of this subsection do not depend upon the single-controller assumption. As 
a by-product of this proof, we obtain a simple algorithm for computing equilibria for all such 
games. This algorithm is analogous to policy-iteration in dynamic programming, and consists of 
solving a sequence of simple (non- stochastic) games whose value-vectors converge to the true 
value vector. 

Let y) be a polynomial, and A = [0, 1] be the strategy space of players 1 and 2. Let 
val{p{x,y)) be the value of the zero-sum polynomial game with the payoff function as p{x,y) 
and the strategy space A. It can be shown that a mixed- strategy Nash equilibrium always exists 
for two-player zero-sum polynomial games [8J, and they can be computed using semidefinite 
programming [6J. 

Lemma 1: Let pi{x, y) and P2{x, y) be given polynomials. Then 

|val(pi(z,?/)) - val(p2(a;,?/))| < max \pi{x,y) - p2{x,y)\. 

Proof: Let /ii, vi be the optimal strategies for the polynomial zero-sum game with payoff 
Pi{x,y) (so that ¥j^^^y^[pi{x,y)] = val(pi(x, y))) and /i2,/^2 be the optimal strategies for the 
game with payoff p2{x.y). If val(pi) = val(p2) the result is trivial, so without loss of generality, 
assume that val(pi) > va\(p2). By the saddle point property, 

J Pi{x,y)dpidu2 > J Piix,y)dfiidui > j P2{x,y)d^2di'2> J P2{x,y)dpidu2. 
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Here the first inequality follows by considering z/2 to be a deviation of player 2 from his optimal 
strategy (i.e. ui) for the game with payoff pi, the second inequality follows by the preceding 
assumption, and the third inequality follows from a deviation argument for player 1 from his 
optimal strategy. Hence, 

1/ pi{x,y)d^idui - J P2{x,y)dfj,2diy2\ < \J{pi{x,y) - p2{x,y))diJ,idu2\ 

< max^,y6[o,i] \ ipiix,y) ~ P2{x,y))\ J dfiidu2. 

■ 

Note that the quantity on the right is bounded because we are considering the maximum of a 
bounded continuous function on a compact set. Let a E M"^. Given a polynomial game with 
payoff functions r(s, 01,02) and transition probabilities 5,01,02) (sometimes we will hide 
the state indices and write the entire matrix as P(oi, 02)), fix a state s and define the polynomial 
G'*(q;) = r(s, 01,02) + /3^^g^p(t; s, oi, 02)0^4. We will need to perform iterations using this 
vector a E R"^. We call the iterates of these vectors a'' E M'^ (k is the iteration index), and 
denote s*^ component of this vector by a^. Pick the vector 0° E M'^ arbitrarily and define the 
recursion for the s^^ component at iteration k by: 

a'; = vsil{G'{a''-^)), k = l,2,... 

Rephrasing the above in terms of operators, define Tg to be the operator such that 

Tsa = va\{G'{a)). 

Let Ta = [Tia, . . .Tsa]^ .Then the recursion simply consists of computing the terms T^{a). 
Lemma 2: The quantity 

lim T^{a) = <p 

exists and is independent of a. Moreover, (p is the unique fixed point solution to the equation: 

= r0. 



Proof: For a G M'^ define the norm ||a;|| = max^ \as\. Then, 
IIT7 - Tall = max, |val(G'^(7)) - val(G"(a))| 

< max,maXa,_a2g[o,i] 1/9 Y.tPi'^' ■5,ai,«2)(7t - "01 (using Lemma[I]) 

< max,maXaj,a2g[o,i] |/3 J2tPi^'^ s, Oi, 02)] maxf \ {-ft - at)\ 
= P\\l-a\\. 
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Since the discount factor (3 < 1, we have a contraction, and by the contraction mapping principle, 
the iteration T^a is convergent to the unique fixed point of the equation Tcf) = (p. ■ 
Lemma [21 establishes that a fixed point solution to the iteration exists. We now show that the 
fixed point is in fact the value vector of the game. To show this, we show that if we compute 
the optimal strategies to the game s = 1, 2, . . . , 5" then play according to these 

these strategies achieves the value vector 0. Since (p by definition satisfies the saddle point 
inequality (1), an equilibrium solution exists. To show that the value vector is unique, we show 
that any value vector satisfies the fixed point equation Tvfs = Vf^. Since there is a unique fixed 
point by Lemma [21 the value vector must be unique. 

Theorem 1: Let <j) be the fixed point defined in Lemma [21 Then, 

a. Let z/(s) denote the optimal measures to the polynomial game with payoff G^{(f)), s = 
{1, . . . ,S}. Then jj = [/i(l), . . . , u = [z/(l), . . . , I'iS)]'^ are the optimal strategies 
for the stochastic game. 

b. If V/3(/i, ly) is a value vector for the game then vp satisfies Tvp = vp. Hence vp = (p exists 
and is unique. 

Proof: Let fi{s) and i'(s) be the optimal strategies for the game Then by definition, 

the expected value of play under these strategies will be (ps = T^cp = . . . = T^cp. Vectorizing 
this equation, we note that 

(p = T^cp = E^,4r(ai,a2)+/?P(ai,a2)r(ai,a2)+- ■ ■+p'''^P'''\ai,a2)r{ai,a2)+(3''P\aua2)(p]. 

Taking the limit as /c ^ oo, we obtain that (p = ^.[^^p /3'^P'^(ai, a2)r(ai, 02)] = vp^^.y). 
Hence playing according to the stationary strategies //(s), ^{s), s = 1, . . . , S* achieves the value 
vector (p. Suppose player 1 plays according to the strategy ji, and suppose player 2 deviates from 
the prescribed stationary strategy v to stationary strategy u' . Then, since v are defined to be 
an equilibrium strategies for the game G^{(p), we have the (vector) inequality for all v': 

(p = E^,^[r(ai,a2) + /?P(ai,a2)0] 

< E^,^/[r(ai, 02) + /3P(ai, 02)^] 

< E^^^/[r(ai, 02) + (3P{ai, a2)r{ai, 02) + /3^F^(ai, 02)^] 

< E^y[r{ai,a2) + l3P{ai, a2)r{ai, 02) H h f3^P''{ai, a2)r(ai, 02) + l3''P^{ai,a2)(p]. 



June 15, 2008 



DRAFT 



11 



In the first inequality a occurs on the right side. We substitute that inequality in the on the 
right side to obtain the second inequality and so on. Finally, we obtain the inequality: 



E 



fJ,,U 



^/3*^F'=(ai,a2)r(ai, 



«2j 



k=0 



< E 



fj,,u' 



^/?^P*^(ai,a2)r(ai,a2) 



.k=0 



i.e. that cj) = v^^^.v) < z/') for all v' . A similar argument for deviations of player 1 
shows that V/3(yu', v) < f/3(/i, = 0. Hence p{s) constructed as the strategies for the games 
G^{(f)) satisfy the saddle point inequality ([U) component- wise. This establishes the existence of 
equilibria. For uniqueness, note that any strategies v such that Vjjiyjj,, u) satisfies the saddle 
point inequality ([T]), by definition we have Tvp^n.i') = z^). Since T has a unique fixed 
point, the vector Vf^^fi, u) must be unique. ■ 
It is interesting to note that the above proof also provides an algorithm to compute approximate 
equilibria. To compute each iterate Ts{a) one needs to solve a polynomial game in normal form 
(which can be done by solving a single semidefinite program), and by solving a sequence of such 
problems, one can compute T'^(a) which is provably close to the actual value- vector. However, 
the rate of convergence of this iteration is not very attractive. In the rest of this paper, we focus 
attention on single-controller games, for which equilibria can be computed by solving a single 
semidefinite program. 



D. SDP Characterization of Nonnegativity and Moments 

Let A be a closed interval on the real line. The set of univariate polynomials which are 
nonnegative on A have an exact semidefinite description. The set of (finite) vectors in R" which 
correspond to moment sequences of measures supported on A also have an exact semidefinite 
description. We briefly review these notions here and introduce some related notation [[61. 

Let W[x\ denote the set of univariate polynomials with real coefficients. Let p{x) = J2k=o P^x^ ^ 
M[x]. We say that p{x) is nonnegative on A if p{x) > for every x E A. We denote the 
set of nonnegative polynomials of degree n which are nonnegative on A by V{A). (To avoid 
cumbersome notation, we exclude the degree information in the notation. Moreover the degree 
will usually be clear from the context.) The polynomial p{x) is said to be a sum of squares if 
there exist polynomials qi{x), . . . , qk{x) such that p{x) = Yli=i Qii^Y- It well known that a 
univariate polynomial is a sum of squares if and only if p(x) G 'P(M). 
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Let fj, denote a measure supported on the set A. The i*'* moment of the measure /i is denoted 



by 



Let p, — [fj,o, . . . , fj,n] he a vector in ]R"+^. We say that jj, is a moment sequence of length n + 1 
if it corresponds to the first n + 1 moments of some nonnegative measure ji supported on the set 
A. The moment space, denoted by M. {A) is the subset of ]R"+^ which corresponds to moments 
of nonnegative measures supported on the set A. We say that a nonnegative measure is a 
probability measure if its zeroth order moment satisfies jiQ — 1. The set of moment sequences 
of length n + 1 corresponding to probability measures is denoted by M.p{A). 

Let «S" denote the set ofnxn symmetric matrices and define the linear operator Ti, : 
5" as: 



Thus H is simply the linear operator that takes a vector and constructs the associated Hankel 
matrix which is constant along the antidiagonals. We will also frequently use the adjoint of this 
operator, the linear map H* : S"- ^ 









02 






1— > 


02 


O3 • 


■ fln+l 


02n-l 






On+1 • 


• 02„_i 



n* : 



mil rni2 

mi2 777,22 



min 



mil 
2mi2 

77122 + 277713 



This map flattens a matrix into a vector by adding all the entries along antidiagonals. 

Lemma 3: Let p{x) = YltLoPkX^ be a polynomial. Let p = [po, . . . ,p2n]'^ be the vector of 
its coefficients. Then p{x) is nonnegative (or SOS) if and only if there exists S e S'^'^^, S 
such that: 

p^n*{s). 
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Proof: For univariate polynomials, nonnegativity is equivalent to SOS (see |l9l). Let [x\n = 

[1, x, . . . , x'^Y . We have for every S G S"-^^, 

Factoring 5" ^ 0, we obtain a sum of squares decomposition. The converse is immediate. ■ 
One can give a similar semidefinite characterization of polynomials that are nonnegative on an 
interval. Since in this paper we are typically considering the interval to be [0, 1] we give an 
explicit semidefinite characterization of ^([0, 1]). We define the following matrices: 



Li 



^nxn 



Olxn 







Ixn 



where Inxn stands for the nxn identity matrix. 

Lemma 4: The polynomial p{x) = Yl'k^oPk^'^ nonnegative on [0,1] if and only if there 
exist matrices Z G 5"+^ and W e S"", Z h 0,W h such that 

Po 

1 



P2r 



n*iZ + -{LiWLi, + L2WL{ ) - L2WLi 



Proof: The proof follows from the characterization of nonnegative polynomials on intervals. 
It is well known that 

p{x) > Va; G [0, 1] <S=^ p{x) = z{x) + x{l — x)w{x), 

where z{x) and w{x) are sums of squares. A simple application of Lemma [3] yields the required 
condition. ■ 
In this paper, we will also be using a very important classical result about the semidefinite 
representation of moment spaces IfTOll , [fTTll . We give an explicit characterization of A^([0,1]) 
and Mpi[0,l]). 

Lemma 5: The vector Jx = [fiQ, fii, . . . , fi2nY ^ valid set of moments for a nonnegative 
measure supported on [0, 1] if and only if 

n{jj) y 



(2) 



(Li nijj)L2 + Li, nift)Li) - Li n{jj)L2 h o. 
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Moreover, it is a moment sequence corresponding to a probability measure if and only if in 
addition to Q it satisfies /ig = 1. 

Proof: The proof follows by dualizing Lemma HI Alternatively, a direct proof may be found 

in [na. ■ 

For example, for 2n = 2 the sequence [^0,^1,^2] is a moment sequence corresponding to a 
measure supported on [0, 1] if and only if the following inequalities are true: 

/ii ^2 
^1 - ^2 > 0. 

III. Finite Strategy Case 

For the reader's convenience and comparison purposes, we briefly review here the case where 
each player has only finitely many strategies at each state S. Again, for simplicity we assume 
that the set of pure strategies available to each player at each state is identical so that Ai = 
A2 = {1, . . . , m}. Under the finite strategy case, when assumption SC holds, a minimax solution 
may be computed via linear programming. We state the linear program in this section. In the 
next section, drawing motivation from this linear program, we write an infinite dimensional 
optimization problem for the case where each player has a choice from infinitely many pure 
strategies. The finite action game is completely defined via the specification of the following 
data: 

1) The state space 5 = {1, . . . , 5*}. 

2) The (finite) sets of actions for players 1 and 2 given hy Ai = A2 = {I, . . . , m}. 

3) The payoff function for a given state s (representable by a matrix indexed by the actions 
of each players) denoted by r(s, Oi, 02). 

4) The probability transition matrix p{s'] s, ai) which provides the conditional probability of 
transition from state s to s' given player I's action ai. 

5) The discount factor p. 

A mixed strategy for player 1 is a function / : 5 x Ai ^ [0, 1] subject to the normalization 
constraint Yl,ai fi^i^^) ^ ^^'^^ s G 5 (so that f{s) = [/(s, 1), . . . , /(s, m)] becomes a 

probability distribution over the strategy space Ai). Similarly the mixed strategy for player 2 in 
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a particular state s is given by g{s) = [g{s,l), . . . , g{s,m)]. The collection of mixed strategies 
(indexed by the states) will be denoted by f = [/(I), . . . , /(S)] (and g = [g{l), . . . , g{S)] 
respectively). A strategy f leads to a probability matrix P(f) = ^^^^^^p{s';s,ai)f{s,ai). 
Again we consider a /9-discounted process over an infinite horizon. Given strategies f and g, 
the reward collected by player 1 in some stage s is given by: 

risj{s),g{s)) = ^ r{s,ai,a2)fis,ai)g{s,a2). 

The reward collected over the infinite horizon starting at state s, vp{s, f{s), g{s)), is given by 
the system of equations: 

Vf3{sJ{s),g{s)) = r{s,f{s),g{s)) + 
P E.'g5 (EaisAi K^'; 5, ai)f{s, ai)) vp{s', f{s'),g{s')). 

Thus, 

v^(f,g) = (/-/3P(f))-yf,g), 

where r(f,g) = [r{l, f{l), g{l)), . . . ,r{SJ{S), g{S))] G M"^. The problem is to find equilibrium 
strategies f° and g° that satisfy the Nash equilibrium property: 

vMf,g°)<v«(f°,g°)<v^(fO,g) (3) 

for all mixed strategies f , g. 

Theorem 2 (f3^): Consider the primal-dual pair of linear programs: 

minimize X]f=i ^("^) 
g{s,a2),v{s) 

^(«) > Ea2eA2 ^(*' ^1' «2)^(s, 02) + 

/5 E!'=i P{s'\ s, a^)v{s') \/s eS,a^e (P) 

Ea2gA2^(^'«2) = 1 Vse5 



June 15, 2008 



DRAFT 



16 



and 



maximize J2^=i^i^) 
x{s, oi), z{s) 

Ef=iEaieAi['^(S'^^') - Pp{s',s,ai)]x{s,ai) = 1 Ws' e S 
^(^) < SaiGAi ^(-^i ai)r{s, ai, 02) Vs G 5, 02 G A2, 



x{s, ai) > 0, Vs G 5, ai G Ai. 
Let p* be the optimal value of (P), and d* be the optimal value of {D). Let ai) be the 
optimal values of the ai) variables obtained in (D). Let 

Lai 

and g*(s,a2) be the distribution obtained by the optimal solution of (P). Then the following 
statements hold: 

1) p* = d*. 

2) Let V* = [v*{l),..., v*{S)] be the optimal solution of (P). Then v* = V/3(f*, g*). 

3) v/j(f*,g*) satisfies the saddle-point inequality ([3]). 

Remark Note that statement 2 claims that the solution of the LP (P) corresponds to the infinite 
horizon discounted reward obtained when players 1 and 2 play according to the distributions f* 
and g*. Statement 3 claims that these distributions are in fact optimal for the two players in the 
Nash equilibrium sense. 

Proof: See [3. pp. 93]. ■ 

Remark Note that the primal problem (P) has a natural interpretation in terms of security 
strategies. Feasible vectors v, and g satisfy the first set of inequalities in (P). The inequalities 
can be interpreted to mean that using strategy g the payoff of player 2 will be at most v. 
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IV. Infinite Strategy Case 

A. Problem Setup 

In this section we consider the case where each player can choose from uncountably many 
different actions. In particular, each player can choose actions from the set [0, 1]. The number 
of states \S\ = 5* is still finite. The payoff function r(s, ai, 02) is a polynomial in ai and 02 for 
each s e S. The single controller case (Assumption SC) is studied. In this case, we assume that 
the probability of transition p{s'; s, ai) is a polynomial in ai. Again we consider the two-player 
zero sum case where player 1 attempts to maximize his reward over the infinite horizon. We 
generalize the problem (P) to this case. The variables f and g representing distributions over 
the finite sets Ai and A2 are replaced by measures //(s) and i/(s). These measures represent 
mixed strategies over the uncountable action spaces. (We remind the reader that for each player 
there are S measures, each measure corresponding to a mixed strategy in a particular state. For 
example corresponds to the mixed strategy player 1 would adopt when the game is in state 
s.) 

B. Preliminary Results 

We point out that the generalization of (P) to this case is an optimization problem involving 
non-negativity of a system of univariate polynomials with coefficients that depend on the mo- 
ments of these measures. The interpretation in terms of security strategies for player 2 holds. 
The following is the generalization of the linear program (P) mentioned above: 

minimize Yls=i'^i^) 
u{s),v{s) 

(a) v{s) > /^^^^^ r(s, ai, a2)du{s) + 

/3 ^f,^;L;?(s'; s, ai)v(s') for all s e S,ai E Ai 

(b) u{s) is a measure supported on A2 for all s e »S 

Since J r{s^ai,a2)du{s) — qi,{s,ai), a univariate polynomial in oi for each s e <S, for a 
fixed vector v{s), the constraints (a) are a system of polynomial inequalities. Note that the 

coefficients of q will depend on the measure u only via finitely many moments. More con- 
cretely, let r(s, Oi, 02) = YlTj'"^' fij{s)a\al be the payoff polynomial. Then J r{s, ai,a2)di'{s) ~ 
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''"ij{^)a\i^j{s). Using this observation, this problem may be rewritten as the following prob- 
lem. 

minimize '^^=iv{s) 

(c) v{s)-^^^.rij{s)a\iyj{s)- ^^^^ 
/?Ey=iP(s'; s,ai)v{s') e for all s e 5 

(d) P{s) e M{A2), and uois) ^ 1 for all seS. 

The constraints (c) give a system of polynomial inequalities in ai, one inequality per state. Fix 
some state s. Let the degree of the inequality for that state by dg. Let [ai\d^ = [1, oi, of, . . . af"]. 
The first term in constraint (c) can be rewritten in vector form as: 

where R{s) is a matrix that contains the coefficients of the polynomial r(s, 01,02). Similar to 
the finite strategy case we define a vector by v* = . . . , i)*{S)]'^ which will turn out to be 

the value vector of the stochastic game (which is indexed by the state). The second term in the 
constraint (c) which depends on the probability transition p{s'; s, oi) is also a polynomial in oi 

whose coefficients depend on the coefficients of p{s'; s, ai) and v. Specifically 

s 

^p{s';s,ai)v{s') = y^Q{s)^[ai]d,, 

s'=l 

for some matrix Q{s) which contains the coefficients of p{s'; s,ai). 

Lemma 6: Let Ai = A2 = [0, 1]. Let Es G W^"^^ be the matrix which has a 1 in the (1, s) 
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position. Then the semidefinite program (SP) given by: 

minimize J2s=i''^(^) 
iy{s),v{s) 

(e) n*{Z, + l{L,WsL^ + L2WsLj)-L2WsLl) 
= EsV - (3Q{s)v - R{s)u{s) Vs G 5 

(/) n{u{s)) hO \/seS 

(SP) 

(g) \{L,^n{v){s)L^ + ^n{v){s)L,) 
-L2^n{u){s)L2hO Vsg5 

(h) ei^z/(s) = 1 Vs e 5 

exactly solves the polynomial optimization problem {P'). 

Proof: The polynomial in inequality (c) has the coefficient vector EsV — l3Q{s)v — R{s)u{s). 
The proof follows as a direct consequence of Lemma |4] concerning the semidefinite representation 
of polynomials nonnegative over [0, 1], and Lemma [5] concerning the semidefinite representation 
of moment sequences of nonnegative measures supported on [0, 1]. ■ 
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(SD) 



The dual of (SP) is given by the following semidefinite program: 

maximize J2s=i'^i^) 

Rj^{s) - a{s)ei Vs G 5 
(A;) n{^{s)) hO Vs e 5 

EsiEs - ms)ms) = I 

(m) A,R^O Vse<S. 



Lemma 7: The dual SDP (SD) is equivalent to the following polynomial optimization prob- 
lem: 

maximize X]f=i '^(■s) 



{D') 



(o) l{s)eM{A2) \/seS 

Proof: This again follows as a consequence of Lemmas |4] and [51 ■ 

Remark Note that in the dual problem, the moment sequences do not necessarily correspond to 
probability measures. Hence, to convert them to probability measures, one needs to normalize 
the measure. Upon normalization, one obtains the optimal strategy for player 1. 
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Lemma 8: The polynomial optimization problems (P') and {D') are strong duals of each 
other. 

Proof: We prove this by showing that the semidefinite program (SP) satisfies Slater's 
constraint qualification and that it is bounded from below. The result then follows from the 
strong duality of the equivalent semidefinite programs (SP) and (SD). 

First pick fi{s) and ^{s) to be the uniform distribution on [0, 1] for each state s E S. One can 
show [[TOl that the moment sequence of fi is in the interior of the moment space of [0, 1]. As a 
consequence, constraints (f) and (g) are strictly positive definite. Using the strategies ^ and v, 
evaluate the discounted value of this pair of strategies as: 

Choose V > v^. The polynomial inequalities given by (c) are all strictly positive and thus 
constraints (i) are strictly positive definite. The equality constraints are trivially satisfied. 
To prove that the problem is bounded below, we note that r(s, ai, 02) is a polynomial and that 
the strategy spaces for both players are bounded. Hence, 

inf r(s, 01,02) 

is finite and provides a trivial lower bound for v(s). ■ 
Lemma 9: Let i'*{s) and be optimal moment sequences for (P') and {D') respectively. 
Let i'*{s) and ^*(s) be the corresponding measures supported on Ai and A2 respectively. The 
following complementary slackness results hold for the optima of (P') and {D'): 

Ia, dCis) = r(s, ai, a2)dC{s)dp*{s)+ 

PY.s'^*{s')j^^p{s'-s,a,)dC{s) WseS 

L2 ^^*(^) = /a2 Ia, '^(^' «i' a2)dC{s)du*{s) 

Vs G S. 

Proof: The result follows from the strong duality of the equivalent semidefinite representa- 
tions of the primal-dual pair (P') — (D'). The Lagrangian function for (P') is given by: 

= infv,^{Ef=i ^(«) " /aJ^(^) ' JA^r{s,ai,a2)du{s) 

C{^,a) must satisfy weak duality, i.e. d* < p*. At optimality p* = some vector 

V*. However, strong duality holds, i.e. p* = d*. This forces the first complementary slackness 



June 15, 2008 



DRAFT 



22 



relation. The second relation is obtained similarly by considering the Lagrangian of the dual 
problem. ■ 
We have shown that problem (P') can be reduced to the semidefinite program (SP), and 
is thus computationally tractable via convex optimization algorithms. We next show that the 
solution to problem (P') is in fact the desired equilibrium solution. 

C. Main Theorem 

Let p* be the optimal value of (P'), and d* be the optimal value of {D'). Let and 
be the optimal measures recovered in (P') and (D'). Let 

,, > CM 

so that /i* is a normalized version of ^* (i.e. //* is a probability measure). Let v* be the vector 
obtained as the optimal solution of (P'). 

Theorem 3: The optimal solutions to the primal-dual pair (P'), {D') satisfy the following: 

1) p* = d*. 

2) V* = vpifi*,iy*). 

3) v^(/i*,z/*) satisfies the saddle-point inequality: 

V/3(/i, 1^*) < V^(/i*, U*) < p) (6) 

for all mixed strategies /i, v. 
Proof: 

1) Follows from the strong duality of the primal-dual pair (P') — {D'). 

2) Using Lemma |9] equation dH) in normalized form (i.e. dividing throughout by ^^{s), which 
is the zeroth order moment of the measure i{s)) we obtain 

= Li '^(^' a2)d^x*{s)dv*{s) + 

Upon simplification and vectorization of v*(s) one obtains 

V* = r(/i*,/y*) + /3P(/i*)v*. 

Using a Bellman equation argument or by simply iterating this equation (i.e. substituting 
repeatedly for v*) it is easy to see that v* = V/3(/i*, u*). 
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3) Consider inequality (c) it at its optimal value. We have for every state s: 

v*{s) > J^^^^^r{s,a^,a2)diy*{s) + 

/5Ef'=iP(s';5,«i)^*(^')- 

Integrating with respect to some arbitrary probability measure fi{s) (with support on Ai), 
we get: 

v*{s) > J^^ J^^ r(s, ai, a2)di2{s)diy*{s) + 
P Ef'=i ai)v*{s')dfi{s). 

Thus, 

v*{s) > r{s,fi{s),u*{s)) + 

^ Ef'=i Li ai)f *(s')t^/i(s). 

Iterating this equation, we obtain V/3(//*, z/*) = v* > v/3(/i, z/*) for every strategy //. This 
completes one side of the saddle point inequality. 
Using the normalized version of equation (5), we get: 

W) = U a2)dfi*{s)du*{s) 

If we integrate inequality (n) in problem {D') with respect to any arbitrary probability 
measure iy{s) with support on A2 we obtain 

Thus r(s, fi*{s),u*(s)) < r(s, z/(s)) for every s. Multiplying throughout by (/ — 
/3P(/i*))^^, we get v^(/i*,z/*) < v^(/i*,z/). This completes the other side of the saddle 
point inequality. 



D. Obtaining the measures 

Solutions to the semidefinite programs {SP) and {SD) provide the moment sequences corre- 
sponding to optimal strategies. Additional computation is required to recover the actual measures. 
We briefly describe a classical procedure to recover the measures using linear algebra. For more 
details, the reader may refer to ifTTI . ifTlll . 
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Let p, e IR^" be a given moment sequence. We wish to find a nonnegative measure supported 
on the real line with these moments. The resulting measure will be composed of finitely many 
atoms (i.e. a discrete measure) of the form '^Wi6{x — ai) where 



Prob(a; = Oj) = Wi 
Construct the following linear system: 





111 . 


■ A*n-1 




Co 




IJ-n 




A«2 • 


■ /^n 




Cl 








IJ^n ■ 


• A''2n-2 




. ^n-l _ 




. A*2n-1 



Note that the Hankel matrix that appears on the left hand side is a sub-matrix of H^fi). We 
assume without loss of generality that the above matrix is strictly positive definite. (Suppose the 
above matrix is not full rank, construct a smaller kxk linear system of equations by eliminating 
the last n — k rows and columns of the matrix so that the k x k submatrix is full rank, and 
therefore strictly positive definite.) By inverting this matrix we solve for [cq, . . . , c„_i]^. Let Xj 
be the roots of the polynomial equation 

X"- + Cn-lX"-'^ H h CiX + Co = 0. 

It can be shown that the xi are all real and distinct, and that they are the support points of the 
discrete measure. Once the supports are obtained, the weights Wi may be obtained by solving 
the nonsingular Vandermonde system given by: 



E 

i=l 



WiX^ 



H (0<j<n-l). 



V. Example 



Consider the two player discounted stochastic game with /3 = 0.5, S — {1,2} with payoff 
function r(l, ai, 02) = (oi — 02)^ and r(2, ai, 02) = — (ai — 02)^. Let the probability transition 
matrix be given by: 

oi 1-ai 

P(ai) - 
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1-ai 




Fig. 2. A two state stochastic game with transition probabilities dependent only on the action of Player 1. The payoffs associated 
to the states are indicated in the corresponding nodes. The edges are marked by the corresponding state transition probabilities. 

Figure [21 graphically illustrates this stochastic game, consisting of two states (the nodes) with 
polynomial transition probabilities dependent on ai (as marked on the edges of the graph). Within 
the nodes, the payoffs associated to the corresponding states are indicated. 

To understand this game, consider first the zero-sum (nonstochastic game) with payoff function 
p{ai,a2) = {ai — 02)^ over the strategy space [0,1]. This game (called the "guessing game") 
was studied by Parrilo in [|6l. If Player 2 is able to guess the action of Player 1, he can simply 
imitate his action (i.e. set 02 = oi and his payoff to player 1 would be zero (this is the minimum 
possible since (ai — 02)^ > 0). Player 1 would try to confuse player 2 as much as possible and 
thus randomize between the extreme actions ai = and ai = 1 with a probability of i. Player 
2's best response would be to play 02 = | with probability 1. 

In the game described in Fig. 2, in State 1 Player 1 plays the role of confuser and Player 2 
plays the role of guesser. In state 2, the roles of the players are reversed. Player 1 is the guesser 
and Player 2 the confuser. However, the problem is complicated a bit by the fact that State 1 
is advantageous to Player 1 so that at every stage he has incentive to play a strategy that gives 
him a good payoff as well as maximize the chances of transitioning to State 1. 

The polynomial optimization problem that computes the minimax strategies and the equilib- 
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rium values is the following: 

minimize + v{2) 

v{l) > J{ai-a2 fdu{l) + 

(3{aiv{l) + {1 - ai)v{2)) Vai G [0, 1] 

v{2) > - J{ai-a2ydu{2) + 

(3{{1 - al)v{l) + alv{2)) Vai G [0, 1] 

z/(2) probability measures supported on [0, 1]. 

This problem can be reformulated as follows: 
minimize v{l) + v{2) 

v{l) > al - 2aiui{l) + U2{1) + 

(3{aiv{l) + {1 - ai)v{2)) Voi G [0, 1] 

v{2) > -al + 2aiz/i(2) - ^2(2) + 

(3{{1 - aj)v{l) + alv{2)) Voi G [0, 1] 

[1, Ml), iy2{l)f , [1, /^i(2), G A^([0, 1]). 

Solving the SDP and its dual we obtain the following optimal cost-to-go and optimal moment 
sequences: 

V* = [.298, -.158]^ 

//*(!) = [1,. 614, .614]^ fl*{2) = [l,.5,.25f 

= [1, .614, .377]^ r(2) = [1, .614, .614]^. 
The corresponding measures obtained as explained in subsection IIV-DI are supported at only 
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finitely many points, and are given by the following: 

= .386 5(ai) + .614 d{ai - 1) 
fi*{2) = 6{a^ - .5) 

= 5{a2 - .614) 
iy*{2) = .386 6{a2) + .614 S{a2 - 1). 

Consider, for example, play in State 1. If Player 1 were playing obliviously with respect to 
the state transitions, he would play actions ai = and ai = 1 with one half probability each. 
However, to increase the probability of staying in State 1 he plays action 1 with a higher 
probability. Player 2 cannot affect the state transition probabilities directly, thus he must play a 
myopic best response. (A myopic best response is one that is a best response for the game in the 
current state). Note that in state 1, once Player I's strategy is fixed, the (only) best response for 
Player 2 is to play the action a2 = 0.614 with probability 1. In state 2, player I's best strategy 
is to play ai = 0.5. Player 2 picks an action from his myopic best response set (in this case, all 
probability distributions that are supported on the points and 1). 

VI. Conclusions and future work 

In this paper, we have presented a technique for solving two-player, zero-sum finite state 
stochastic games with infinite strategies and polynomial payoffs. We established the existence of 
equilibria for such games. As a by-product we got an algorithm that converged to unique value 
vector of the game (however this algorithm does not seem to have very attractive convergence 
rates). We focused mainly on the case where the single-controller assumption holds. We showed 
that the problem can be reduced to solving a system of univariate polynomial inequalities and 
moment constraints. We used techniques from the classical theory of moments and sum-of- 
squares to reduce the problem to a semidefinite programming problem. By solving a primal-dual 
pair of semidefinite programs, we obtained minimax equilibria and optimal strategies for the 
players. 

It is known that finite-state, finite action, two-player zero-sum games which satisfy the or- 
derfield property lfT3ll . lEl may be solved via linear programming. The single-controller case, 
games with perfect information, switching controller stochastic games, separable reward-state 
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independent transition (SER-SIT) games and additive games all satisfy this property. We intend 
to extend these cases to the infinite strategy case with polynomial payoffs. General finite action 
stochastic games which do not satisfy the orderfield property still have an interesting math- 
ematical structure, but efficient computational procedures are not available. Developing such 
procedures present an interesting direction of future research. 

Acknowledgement: The authors would like to thank Ilan Lobel and Prof. Munther Dahleh for 
bringing to their attention the linear programming solution to single controller finite stochastic 
games. 
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